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Abstract 

The pricing of American style and multiple exercise options is a very 
challenging problem in mathematical finance. One usually employs a 
Least-Square Monte Carlo approach (Longstaff-Schwartz method) for the 
evaluation of conditional expectations which arise in the Backward Dy- 
namic Programming principle for such optimal stopping or stochastic con- 
trol problems in a Markovian framework. Unfortunately, these Least- 
Square Monte Carlo approaches are rather slow and allow, due to the 
dependency structure in the Backward Dynamic Programming principle, 
no parallel implementation; whether on the Monte Carlo level nor on the 
time layer level of this problem. 

We therefore present in this paper a quantization method for the com- 
putation of the conditional expectations, that allows a straightforward 
parallelization on the Monte Carlo level. Moreover, we are able to de- 
velop for AR(l)-processes a further parallelization in the time domain, 
which makes use of faster memory structures and therefore maximizes 
parallel execution. 

Finally, we present numerical results for a CUDA implementation of 
this methods. It will turn out that such an implementation leads to an 
impressive speed-up compared to a serial CPU implementation. 

Keywords: Voronoi Quantization, Markov chain approximation, CUDA, Paral- 
lel computing for financial models, Stochastic control. 

1 Introduction 



The pricing of American style and multiple exercise options consists of solving 
the optimal stopping problem 

V ~ csssup<^ E((p^(A't-)| Jq) : t is a (J'fe)-stopping time \ 
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for an adapted stochastic process {Xk)o<k<n on a filtered probability space 
(ri, (J"fc)o<fc<n, IP) and obstacle functional ipk,0 < k < n. 

It is well known (see e.g. [13]) that V is given by the solution Vb to the 
Backward Dynamic Programming (BDP) Principle 

Vk = max(^t,(Xfe); E(Vi.+i | J-^)) , < fc < n - 1. 

We focus here on the case of an adapted Markov chain {Xk), so that it holds 
E(Vfc+i|J^fe) = E(Vfc+i|Xfc). Then the main difficulty of solving ([1]) by means of 
Monte Carlo methods lies in the approximation of the conditional expectations 
E(yfe+i|Xfc). This is usually accomplished by a Least Squares regression as 
proposed by the LongstafF- Schwartz method. Following [6l [11] and [15] the 
main steps of this procedure consists of 

• Simulating AI paths of {Xk) (forward step) 

• Starting at k = n — 1, approximate fk{x) ~ M{Vk+i\Xk = a;) by a Least 
Squares regression and proceed backwards to 0. (backward step) 

From a practical point of view, the most expensive tasks are clearly the 
repeated Least Square regressions on the huge number of Monte Carlo paths. 
Due to the sequential dependency structure of the Backward Dynamic Program- 
ming formula, the collection of the Least Squares problems as a whole cannot 
be solved in parallel, but has to be processed in strict sequence. Moreover, it is 
not an easy task to solve the single Least Square problems efficiently in parallel. 

We therefore present in this paper a Quantization Tree algorithm, which 
handles the most part of the work in a forward step which can be easily paral- 
lelized on the Monte Carlo level (pathwise) as well as on the time layer level. 
Therefore, this approach is well suited for the use of massive parallel computing 
devices like GPGPUs. Using this approach, the subsequent backward process- 
ing of the BDP principle becomes straightforward and negligible in terms of 
computational costs when compared to the Least Squares backward step. 

2 The Quantization Tree Algorithm 

The Quantization Tree algorithm is an efficient tool to establish a pathwise 
discretization of a discrete-time Markov chain (see e.g. [TJ [21|3] or [S]). Such a 
discretization can be used to solve optimal stopping or control problems, as they 
occur in the evaluation of financial derivatives with non- vanilla exercise rights. 
In this paper, we focus on a fast computation of the transition probabilities 
in a Quantization Tree by means of GPGPU-devices, which make this approach 
suitable for time-critical online computations. 

Therefore, let {Xk)o<k<n be a discrete-time L^-Markov chain on a filtered 
probability space (ri, (J^fc)o<fe<?i, P) with values in the vector space {R'^,B'^). 
This vector space shall be endowed with an appropriated norm (often Euclidean 
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norm). For each time-step k we furthermore assume to a have a quantization 
grid 

Tfc = {xi, . . . , Xff^) 

of size Nk- 

This means that Tk provides a discretization of the state space of the r.v. 
Xi^ , which is supposed to minimize the quadratic quantization error 

Eram\\X,-x^r (2) 

l<i<Nk 

over all possible grids Ffe C M"^ with size Ir^l < Nk- (See [5] for a comprehensive 
introduction to quantization of probability distributions.) 

For a grid F^, let (Ci(Ffe)) ^^^^^^ be a Voronoi Partition of M** induced by 
the points in F^, i.e. 

Q(Ffe) C {y e M"^ : \\y - x.'^H < ^ mni^Jly - ^'W}- 
Wc then call the mapping 



Z 

1=1 



the Nearest Neighbor projection of 2 onto F^ . 

This Nearest Neighbor projection defines in a natural way the Voronoi Quan- 
tization 



i=l 

which obviously provides a discrete r.v. with not more than states and 
E\\X,-Xl''r=E mm WX^-x'^f- 

l<i<Nk 

Defining the cartesian product quantizer 

n 

r = nr. 

we arrive at a path discretization of the Markov chain (Xk) with |F| < rifc=o 
paths, which we will call the Quantization Tree (see Figure [T]). 

To equip F with a probability distribution, we introduce the transition prob- 
abilities 

4=V{X^=x^\xl^^xt') 

= ¥{Xk G Q(Ffc) I e C,(Ffc_i)). 

If the marginal distributions of (Xk) are Gaussian and the norm is the canon- 
ical Euclidean norm, grids which minimize ^ arc precomputed and available 
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Figure 1: A Quantization Tree F 

at |13| . Otherwise, some sub-optimal grids, matching the first two moments of 
Xk, can be employed at the price of not achieving the full optimal convergence 
rate. 

Nevertheless, the true difficulties of this approach actually consist in the 
computations of the transition probabilities 7r*j . These probabilities are usually 
so strongly connected to the individual choice of the Markov chain (Xk) that 
they cannot be precomputcd like the above quantization grids or approximated 
by simple means. 

We therefore have to perform a Monte-Carlo (MC) simulation of the Markov 
chain (Xk) in order to estimate the transition probabilities 7r*j . Since these MC 
simulations can be quite time consuming, we will take advantage of the massive 
parallel computing capabilities of nowadays GPGPU-devices and reduce the com- 
putational time for the estimation of the transition probabilities to a level that 
actually is acceptable for time-critical applications in financial practice. 

As the Quantization Tree F exhibits a pathwise approximation of the Markov 
chain (Xk), we may numerically solve on F stochastic control or optimal stopping 
problems like they occur e.g. in the valuation of options with non-vanilla right 
exercises. 

In [5], the optimal stopping problem 

V ~ csssup^E(^(pT-{Xr)\J-o) : t a (J^fe)-stopping timc| (4) 

with a payoff function ipt{x) = (soCxp((r — a'^/2)t + (tx) — K^^ and (Xk) a 
d-dimensional time-discretizcd Brownian motion is solved to approximate Amer- 
ican option prices. 

In [3], the authors employ the Quantization Tree to solve the stochastic 
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control problem 



P{Q) = esssupjEf^ QkVkiXk 



Joj :Vfc = 0,...,n-l : 

71 — i N 

maxj ( 7 



fc=0 



where Vk can be interpreted as a payoff function and the couple Q = (Qmin, Qmax) 
provides some global constraints on the cumulated consumption X]fc=o 
that ([5]) yields the fair value of a swing option, which is an important derivative 
in energy trading. 

Concerning the Quantization Tree algorithm, note that T contains such a 
huge number of paths (e.g. at least 100"^^^ in the example below) that it is 
impossible to process above problems in a path-wise manner. 

Therefore, one usually resorts on the Backward Dynamic Programming (BDP) 
Principle^ which allows a time-layer wise proceeding. This approach yields a 
complexity of C^^^-^ Nk-iNk, i.e. increases only linearly in n. 

In case of the optimal stopping problem the true BDP-principle can be 
approximated by setting 

Vk = max(^t,(X['=); E(i4+i jX^*^)) , < fc < n - 1, 

so that the measurable r.v. Vq yields an approximation for V . Doing so we 
somehow "force" the Markov property of the Quantization sequence [X^''). 

In case of the stochastic control problem ([S]), it was shown in [3] that there 
exists a bang-bang control for ([5]), so that the BDP-principle leads to 

P„ =0 
PfclQ'^) =max{.Ti;fe(X[^) 

+ E(Pfe+l(x"-'-l(g^x))|x['=),x e {cijn/^-^-i}, 

where the set Iq and the function x^iQi^) ensure to keep consumption within 
the global constraints [Qmin, Qmax]- 

In both cases, we have to evaluate conditional expectations E(/(Xfe+i)|Xfc), 
which reduce on F to 

Concerning the approximation error for this approach, assume that the Vk 
are Lipschitz-continuous and that {Xk) has Lipschitz-Feller transition kernels. 
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We then get in case of a trivial cr-field Tq for a constant C > (see [3], Thm 3) 

1/2 

\PiQ)-Pom<CY.[E\\Xk-Xl''f) . 

fc=0 

3 Swing options in the Gaussian 2-factor model 

We will now focus on the implementation of the Quantization Tree algorithm for 
the valuation of Swing options in a Gaussian 2-factor model and present in detail 
the computation of the transition probabilities using CUDA on a GPGPU-device. 
In this model, the dynamics of the underlying are given as 

St = so exp (^aij^ e~'''^'-''>dW^ + e-^'^'^'-^Uw^ - ^fXt^ 

for Brownian Motions and with some correlation parameter p. 

Having introduced the time discretization tk = k/n, k = 0, . . . , n, we con- 
sider the 2-dimensional Ornstein-Uhlcnbeck process 

Xk=(^J e-'^'^'""'^ dW} , j e-'°'^^'''-'Uw'iy (6) 

This Markov chain admits a useful representation as a first-order auto- 
regressive (AR-l)-Y>T:ocess: 

Proposition 1 For {X^) from (0j it holds 

Xk+i = AkXk + Tkek, fc = 0,...,n-l, 

where andTk are deterministic matrices and {e^) is an i.i.d. standard normal 
sequence. 

In order to estimate the transition probabilities 

4 = F{Xk e Cj{Tk) I Xk-1 e a{Tk-i)) 
P{Xk-i e aiVk^i)) 

we will therefore simulate M samples of (Xk) according to Proposition [T] and 
perform in each time-layer k a Nearest Neighbor search to identify the Voronoi 
cell Cj{Tk) in which Xk falls. 

Using the additional counters p^j and p^, a serial implementation for the 
estimation of irf^ is given by Algorithm I. 

We will adopt a numerical scenario, which has already proven in [5] to pro- 
duce accurate results for the valuation of Swing options. Thus we set 

# MC-SampIes: M = 100.000 



6 



Algorithm I 



for TO = 1, . . . , M do 

# Initialization 

X <— xo, i ^ 0, ^ 1 

for /c = 1 , . . . , n do 

Simulate 

a; ^ AfcX + TfcEfc 

Find NN-Indcx j of a; in 

Set 

+=1 

+= 1 

end for 
end for 

Set 4 ^ ^, 1 < i, j < iVfc, 1 < fc < n. 



# Exercise days: n = 365 

Grid size: iV = A^t = 100 - 500 for fc = 1, . . . , n. 

This setting results in a computational time of 30—90 seconds for non-parallel 
estimation of the transition probabilities on a Intel Core i7 CPU@2.8GHz 
and N = 100 to 500. 

Since any parallel implementation of the above algorithm has to perform 
actually the following steps 

1. ) generation of the independent random numbers 

2. ) a Nearest Neighbor search 

3. ) updating the counters , 

we will discuss these tasks in more detail with respect to an implementation for 
CUD A. 

The amount of data which has to be processed in these steps when using 
single precision floating-point numbers is summarized in Table [TJ 



Table 1: Amount of data to be processed for A = 100 — 500. 





per layer k 


total 


^ Random numbers 


100k 


36.5M 


# Nearest Neighbor searches 


100k 


36.5M 


size of and 


40kB - 1MB 


15 - 365MB 


size of grids F^. 


SOOByte - 4kB 


285kB - 1.5MB 
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3.1 Random number generation 



The challenge of random number generation on parallel devices consists in mod- 
ifying the sequential random number generator algorithm in such a way, that 
the original sequence {x„, n = 1, . . . , M} with M = k ■ s 

• is generated in independent blocks of size s, i.e. k streams {xn-s+ij * = 
1, . . . , s}, where n = 0, . . . , fc — 1 (block approach) 

or 

• can be partitioned through a skip-ahead procedure, i.e. one generates 
independently s streams {xn+i-s, i = 0, . . . , fc — 1} for n = 1, . . . , s (skip- 
ahead) 

The block-approach can be accomplished by generating a well chosen se- 
quence of seed values to start the parallel computation of the random number 
streams. In contrast to this, for the skip-ahead approach we have to modify 
the main iteration of the random number generator itself . Nevertheless, this 
modification can be easily carried out for linear congruential random number 
generators 

Xn+i = aXn + c mod 2™, 
For this kind of generator it holds 

Xn+s = Axn + C mod 2" 

with A = and C = Yld=o "^'c. Thus, once the coefficients A and C are 
computed, the generation of the subsequence {xn+is, i G N} is as straightforward 
as it is for {x„, n G N}. 

As a first parallel random number generator, we have implemented a parallel 
version of drand48 in CUBA, which operates in 48bit arithmetic. 

A slightly more sophisticated variant of this random number generator is 
given by L'Ecuycr's Multiple Recursive Generator MRG32k3a (cf. [9]) 

xi = (1403580 a;,\_2 - 810728 x,\_3) mod mi 
xl = (527612 a;,^,_i - 1370589 a;,^,_3) mod TO2 
Xn = (a;,\ — xf^) mod mi 

for TOi = 2^2 - 209 and ma = 2^2 - 22853. 

Here, it is again possible to precomputc constants (matrices) to generate the 
skip-ahead sequence {xn+is-,i € N} efficiently (see |10|). An implementation in 
CUBA of this method is given by the CPU-Library of NAG. 

A third kind of random number generators for CUDA is given by Marsaglia's 
XORWOW generator in the CURAND-Library of Cuda Toolkit 3.2. As de- 
scribed in |12j one easily may compute starting seed values for a block approach 
and the random numbers sequence is then given by very small number of fast 
bit-shifts and XOR-operations. To be more precise the initialization procedure 
of the CURAND-Library computes starting values for the blocks which correspond 
to 2^^ iterations of the random engine. Moreover the main iteration of the 
random number generator for the state variables v , w , x , y , z reads 
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unsigned int curandO 
{ 

unsigned int t; 

t = ( X - (x » 2) ) ; 

X = y; 

y = z; 

z = w; 
w = v; 

V = (v ' (v « 4))~(t ~ (t « 1)); 
d += 362437; 
return v + d; 

> 

To illustrate the performance of these three random number generators we 
have chosen a Monte Carlo simulation with a very simple integrand to illustrate 
the performance in simulations where the function evaluation is very cheap. To 
be more precise, we estimated tt = 3.14159265... by a Monte Carlo simulation 
for iA^(i?;2(0, 1)) using M = 10^ random numbers. 

The results for a NVIDIA GTX 480 device and CUDA 3.2 are given in Table 
[21 The mean and the standard deviation of the MC-Estimator were computed 
from a sample of size 500. 



RNG engine 


computational time 


mean 


std. Dev. 


drand48 


0.2562 sec 


3.141590 


5.2585e-05 


MRG32k3a 


0.2573 sec 


3.141594 


5.20932e-05 


CURAND 


0.2085 sec 


3.141592 


5.03272e-05 



Table 2: Results for a Monte Carlo estimation of tt = 3.14159265... 



One recognizes that the XORWOW generator from the CURAND- library slightly 
outperforms the two linear congrucntial implementations, since the XORWOW- 
step can be processed more efficiently than a modulo operation. Nevertheless 
the differences between all three random number generator arc rather marginal. 

Especially, when we have in mind, that the original problem of swing op- 
tion pricing needs only 35M random numbers in total, the generation of this 
amount of random numbers becomes negligible compared to the time spent for 
the nearest neighbor searches. 

3.2 Nearest Neighbor search 

For each MC-realization Xk we have to perform a Nearest Neighbor search in 
every time-layer k to determine the Voronoi cell Cj{Tk) in which Xk falls. 

These Nearest Neighbor searches can be performed completely independent 
of each other, so we implemented them as sequential procedures and only have 
to pay attention to a proper adaption to the CUDA-compute capabilities. 
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Note here that we cannot employ the CUDA built-in texture fetch methods 
for this task, since the grids do in general not consist of a lattice of integer 
numbers. 

From an asymptotical point of view, the A:d-tree methods (cf [7]) obtain the 
fastest results for Nearest Neighbor searches of 0(log -/V)-time. Unfortunately, 
all these divide & conquer-type approaches heavily rely on recursive function 
calls; a programming principle which was introduced only very recently in the 
CUDA Compute Capability 2.x specification. Alternatively, one may implement 
a simple brute force Nearest Neighbor search of 0(A^)-time complexity. 

The results for 36. 5M NN Searches of a random number in a 2-dimensional 
grid can be found in Table [31 It is striking that the brute force approach 



N 


brute force 


fcd-trec 


100 


0.09 sec 


3.56 sec 


250 


0.23 sec 


5.14 sec 


500 


0.41 sec 


6.59 sec 



Tabic 3: Computational time for 36. 5M Nearest neighbor searches on a NVIDIA 
GTX 480 device 

outperforms the fcd-tree method in this setting by a huge factor, even though it 
suffers from a sub-optimal asymptotic behavior. 

Further analysis revealed that, when using the same random number for the 
search in all threads of a given block, the kd-tvee approach took in the same 
setting only 0.25 to 0.34 sec {N = 100 to 500). The dramatic slowdown of Table 
[3l where the NN Search is performed for different random numbers in each single 
thread, must be caused by a very inhomogeneous branching behavior of the 
single threads during the kd-tiee traversal, which prevents the GPGPU-scheduler 
of distributing the threads efficiently. 

We will therefore use in the sequel the brute force approach for the further 
numerical experiments. 

3.3 Updating p'^j 

As soon as we have determined the Voronoi cells Ci{Tk-i) and Cj(rfc) in which 
a realization of {Xk-i,Xk) falls, we have to increase the counter p^^. 

Since, in a parallel execution of steps [3Tl and 13.21 . it can happen that two 
threads try to update the same counter p'^j at the same time, we arrive at the 
classical situation of a race condition. 

Consequently, such a situation would lead to an undetermined result for the 
counter p^j , which practically means that we randomly lose parts of the Nearest 
Neighbor search results. 

To avoid this race condition, we arc forced to employ memory locks, which 
are implemented in CUDA by means of atomic operations. Hence, we have to 
increment p^j by calling the CUDA-function 

int atomicAddCint* address, int val) ; . 
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The resulting parallel procedure is stated as Algorithm II. 



Algorithm II 

for TO = 1, . . . , M do in parallel 

^ Initialization 

X -t^ Xo, i 0, p\ ^ I 

for fc = 1 , . . . , n do 

Simulate 

X ^ AkX + TkCk 

Find NN-Index j of x in Tk 

atomic increment 

atomic increment p'j^^ 

i ^ j 
end for 
end for in parallel 
Synchronize threads 

Set in parallel 7rf- ^ t-, I < i, j < Nk,l < k < n. 



4 Numerical results 

One of the key points in an efficient CUDA-implementation is the choice of the 
proper memory structure for the individual data. Table 21 lists the available 
memory types in CUDA Compute Capability 1.x. 

local memory not cached 16kB per thread 

constant memory cached 64kB per device 

shared memory n/a 16kB per block 

global memory not cached « 1GB per device 

Table 4: Memory types for CUDA compute capability 1.x 

Note that shared memory is (beneath the processor registers) the fastest 
memory available in CUDA, since it resides very close to the processor cores. 
There are 16kB of shared memory available per Multiprocessor, whose content 
is read- and writable by any thread in the same block of a grid. 

The other memory types in Table |4] are about 400 times slower than shared 
memory except constant memory which is cached and therefore achieves a sim- 
ilar read performance as shared memory. 

Taking into account the sizes of the arrays T^ijjPij and Tk from Table [TJ there 
is no other possibility for the above algorithm than to place all the arrays in 
global memory, since any thread has to access the arrays 7r*j , pfj and for any 
k, 1 < k < n. 
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The fact that these arrays have to reside in global memory especially slows 
down the Nearest Neighbor searches, which rely on a fast access to the grid 
points of Tk ■ 

We therefore present another approach, which maximizes the parallel execu- 
tion by splitting up the problem into smaller parts, that can make use of faster 
memory. 

Note that due to Proposition[T]we can directly simulate the couple {Xk,ek) in 
order to get a realization of {Xk, ^fc+i) without the need of generating Xi, I < k. 

Thus, if we accept to generate twice the amount of random numbers and 
double the number of Nearest Neighbor searches, we arrive at Algorithm 111. 

Algorithm III 

for k = 1, . . . ,n do in parallel 
for m = 1, . . . , M do in parallel 

Simulate Xk , e/c 

Find NN-Index i of Xk in Tk 

Find NN-Index j of AkXk + Tkeu in Ffe+i 

atomic increment p^^ 
atomic increment 
end for in parallel 

Synchronize 

Set in parallel TT*;,- ^ ^, 1 < i,j < iVfc 
end for in parallel 



Here, we do not only parallelize with respect to the MC-samples (pathwise), 
but also with respect to the time-layer k. Therefore, wc arc able to perform the 
whole MC-simulation of a given time-layer k (i.e. the inner loop) on a single 
Multiprocessor (i.e. within a single block in CUDA-terminology). 

Hence, we can store the involved grids Vk and Ffc+i entirely in shared memory 
and benefit from a huge performance gain. 

This can be seen in Table [5] and Figure [2l which demonstrates that the shared 
memory implementation - performing even twice as many Nearest Neighbor 
searches - is still significantly faster than the usual pathwise parallelization for 
CUBA Compute Capability 1.x. 



N 


100 


250 


500 


Algorithm II 


0.82 sec 


1.25 sec 


1.83 sec 


Algorithm HI 


0.31 sec 


0.68 sec 


1.38 sec 



Table 5: Computational times for the transition probabilities on a NVIDIA GTX 
295 device 

All the computations for CUDA Compute Capability 1.x were performed on 
a NVIDIA GTX 295 GPGPU, CUDA Toolkit 2.3 and NVIDIA X-Drivcr 190.53 
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0.4 - 



0.2 I 1 ' 1 ' 1 ' 1 ' 1 1 

100 150 200 250 300 350 400 450 500 

N 

Figure 2: Linear performance of Algorithms II & III with respect to N on a 
NVIDIA GTX 295 device 

for 64bit Linux. The running times in Table [5] also include the transfer of the 
transition probabilities Tr^j- back to the host CPU. 

Furthermore, we have chosen in all examples 256 - 512 threads per block and 
overall 365 - 400 blocks. This choice was optimal for our setting. Note here, 
that the shared memory algorithm performs 73 • 10^ Nearest Neighbor searches 
in a 2-dimensional grid. Assuming that the brute force Nearest Neighbor search 

(min < {xi - + {x2 - 1/2)^) 

for each grid point is equivalent to 6 FP-operations (3 additions, 2 multiplica- 
tions, 1 comparison), we already arrive for N = 500 at a computing power of 
approx. 175 GFLOPS only for the Nearest Neighbor searches (the pure kernel 
execution takes in this case 1.25sec). Compared to the peak performance of 
895 GFLOPS for one unit in the NVIDIA GTX 295-device, this fact underlines 
that our implementation exploits a great amount of the theoretically available 
computing power of a GPGPU-devices. 

4.1 Progress in hardware: the Fermi-architecture 

With the arrival of CUDA Compute Capability 2.x and the Fermi-architecture, 
there are now LI- and L2 caches available of up to 48kB per block. It turned out 
that this change in hardware design has strong implications on the performance 
of Algorithm [TTl As it can be seen in Table [S] and Figure [31 the new cache 
can nearly completely compensate the advantage of the shared memory usage 
in Algorithm IIIII Moreover, both parallelizations differ roughly by a factor of 
two which is caused by the fact that algorithm IIIII has to perform twice the 
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N 


100 


250 


500 


Algorithm II 


0.11 sec 


0.30 sec 


0.63 sec 


Algorithm III 


0.21 sec 


0.50 sec 


0.99 sec 



Table 6: Computational times for the transition probabilities on a NVIDIA GTX 
480 device 



Algorithm II .''h — 
Algorithm lit''— x— 

x' 



0.8 - 




Figure 3: Performance of Algorithms II & III with respect to on a NVIDIA 
GTX 480 device 

number of Nearest Neighbor searches than Algorithm HIl The computations for 
CUDA Compute Capability 2.x were performed on a NVIDIA GTX 480 GPGPU, 
CUDA Toolkit 3.2 and NVIDIA X-Driver 260.19.29 for 64bit Linux 

5 Conclusion 

We have shown in this paper that the use of GPGPU-devices is quite efficient for 
the estimation of transition probabilities in a Quantization Tree. Although we 
resorted for the Nearest Neighbor search, which is the most compute intensive 
part of the algorithm, to the sub-optimal brute-force approach, we could achieve 
by means of the massive computing power of a GPGPU-device a speed-up of factor 
200 compared to a serial CPU implementation. Those implementations can 
therefore be used for online estimation of the transition probabilities in time- 
critical applications in practice, which is not possible for a CPU implementation 
that can take more than 1 min for the same task. 
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